{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "aad_1s7ybD5o"
   },
   "source": [
    "# `RoBERTa` --> `Longformer`: build a \"long\" version of pretrained models\n",
    "\n",
    "This notebook replicates the procedure descriped in the [Longformer paper](https://arxiv.org/abs/2004.05150) to train a Longformer model starting from the RoBERTa checkpoint. The same procedure can be applied to build the \"long\" version of other pretrained models as well. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "BieXv0YUd7NF"
   },
   "source": [
    "### Data, libraries, and imports\n",
    "Our procedure requires a corpus for pretraining. For demonstration, we will use Wikitext103; a corpus of 100M tokens from wikipedia articles. Depending on your application, consider using a different corpus that is a better match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "7AZZ4VmKSwvH"
   },
   "outputs": [],
   "source": [
    "!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip\n",
    "!unzip wikitext-103-raw-v1.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "o3yjIYKXw3rL"
   },
   "outputs": [],
   "source": [
    "!pip install transformers==3.0.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "U0NnMMl6wy7Q"
   },
   "outputs": [],
   "source": [
    "import logging\n",
    "import os\n",
    "import math\n",
    "import copy\n",
    "import torch\n",
    "from dataclasses import dataclass, field\n",
    "from transformers import RobertaForMaskedLM, RobertaTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer\n",
    "from transformers import TrainingArguments, HfArgumentParser\n",
    "from transformers.modeling_longformer import LongformerSelfAttention\n",
    "\n",
    "logger = logging.getLogger(__name__)\n",
    "logging.basicConfig(level=logging.INFO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "xgoNVJYUbD59"
   },
   "source": [
    "### RobertaLong\n",
    "\n",
    "`RobertaLongForMaskedLM` represents the \"long\" version of the `RoBERTa` model. It replaces `BertSelfAttention` with `RobertaLongSelfAttention`, which is a thin wrapper around `LongformerSelfAttention`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "J9EBISkRxPjO"
   },
   "outputs": [],
   "source": [
    "class RobertaLongSelfAttention(LongformerSelfAttention):\n",
    "    def forward(\n",
    "        self,\n",
    "        hidden_states,\n",
    "        attention_mask=None,\n",
    "        head_mask=None,\n",
    "        encoder_hidden_states=None,\n",
    "        encoder_attention_mask=None,\n",
    "        output_attentions=False,\n",
    "    ):\n",
    "        return super().forward(hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)\n",
    "\n",
    "\n",
    "class RobertaLongForMaskedLM(RobertaForMaskedLM):\n",
    "    def __init__(self, config):\n",
    "        super().__init__(config)\n",
    "        for i, layer in enumerate(self.roberta.encoder.layer):\n",
    "            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`\n",
    "            layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4LRZa5s1bD6E"
   },
   "source": [
    "Starting from the `roberta-base` checkpoint, the following function converts it into an instance of `RobertaLong`. It makes the following changes:\n",
    "\n",
    "- extend the position embeddings from `512` positions to `max_pos`. In Longformer, we set `max_pos=4096`\n",
    "\n",
    "- initialize the additional position embeddings by copying the embeddings of the first `512` positions. This initialization is crucial for the model performance (check table 6 in [the paper](https://arxiv.org/pdf/2004.05150.pdf) for performance without this initialization)\n",
    "\n",
    "- replaces `modeling_bert.BertSelfAttention` objects with `modeling_longformer.LongformerSelfAttention` with a attention window size `attention_window`\n",
    "\n",
    "The output of this function works for long documents even without pretraining. Check tables 6 and 11 in [the paper](https://arxiv.org/pdf/2004.05150.pdf) to get a sense of the expected performance of this model before pretraining."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "-m4A_ttixPuf"
   },
   "outputs": [],
   "source": [
    "def create_long_model(save_model_to, attention_window, max_pos):\n",
    "    model = RobertaForMaskedLM.from_pretrained('roberta-base')\n",
    "    tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', model_max_length=max_pos)\n",
    "    config = model.config\n",
    "\n",
    "    # extend position embeddings\n",
    "    tokenizer.model_max_length = max_pos\n",
    "    tokenizer.init_kwargs['model_max_length'] = max_pos\n",
    "    current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape\n",
    "    max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2\n",
    "    config.max_position_embeddings = max_pos\n",
    "    assert max_pos > current_max_pos\n",
    "    # allocate a larger position embedding matrix\n",
    "    new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)\n",
    "    # copy position embeddings over and over to initialize the new position embeddings\n",
    "    k = 2\n",
    "    step = current_max_pos - 2\n",
    "    while k < max_pos - 1:\n",
    "        new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]\n",
    "        k += step\n",
    "    model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed\n",
    "    model.roberta.embeddings.position_ids.data = torch.tensor([i for i in range(max_pos)]).reshape(1, max_pos)\n",
    "\n",
    "    # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`\n",
    "    config.attention_window = [attention_window] * config.num_hidden_layers\n",
    "    for i, layer in enumerate(model.roberta.encoder.layer):\n",
    "        longformer_self_attn = LongformerSelfAttention(config, layer_id=i)\n",
    "        longformer_self_attn.query = layer.attention.self.query\n",
    "        longformer_self_attn.key = layer.attention.self.key\n",
    "        longformer_self_attn.value = layer.attention.self.value\n",
    "\n",
    "        longformer_self_attn.query_global = copy.deepcopy(layer.attention.self.query)\n",
    "        longformer_self_attn.key_global = copy.deepcopy(layer.attention.self.key)\n",
    "        longformer_self_attn.value_global = copy.deepcopy(layer.attention.self.value)\n",
    "\n",
    "        layer.attention.self = longformer_self_attn\n",
    "\n",
    "    logger.info(f'saving model to {save_model_to}')\n",
    "    model.save_pretrained(save_model_to)\n",
    "    tokenizer.save_pretrained(save_model_to)\n",
    "    return model, tokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "PkbQvhOMbD6L"
   },
   "source": [
    "Pretraining on Masked Language Modeling (MLM) doesn't update the global projection layers. After pretraining, the following function copies `query`, `key`, `value` to their global counterpart projection matrices.\n",
    "For more explanation on \"local\" vs. \"global\" attention, please refer to the documentation [here](https://huggingface.co/transformers/model_doc/longformer.html#longformer-self-attention)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "CO3MoEgCxP9W"
   },
   "outputs": [],
   "source": [
    "def copy_proj_layers(model):\n",
    "    for i, layer in enumerate(model.roberta.encoder.layer):\n",
    "        layer.attention.self.query_global = copy.deepcopy(layer.attention.self.query)\n",
    "        layer.attention.self.key_global = copy.deepcopy(layer.attention.self.key)\n",
    "        layer.attention.self.value_global = copy.deepcopy(layer.attention.self.value)\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "0KjI9f8BbD6S"
   },
   "source": [
    "### Pretrain and Evaluate on masked language modeling (MLM)\n",
    "\n",
    "The following function pretrains and evaluates a model on MLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "wC2sddeLyVhA"
   },
   "outputs": [],
   "source": [
    "def pretrain_and_evaluate(args, model, tokenizer, eval_only, model_path):\n",
    "    val_dataset = TextDataset(tokenizer=tokenizer,\n",
    "                              file_path=args.val_datapath,\n",
    "                              block_size=tokenizer.max_len)\n",
    "    if eval_only:\n",
    "        train_dataset = val_dataset\n",
    "    else:\n",
    "        logger.info(f'Loading and tokenizing training data is usually slow: {args.train_datapath}')\n",
    "        train_dataset = TextDataset(tokenizer=tokenizer,\n",
    "                                    file_path=args.train_datapath,\n",
    "                                    block_size=tokenizer.max_len)\n",
    "\n",
    "    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)\n",
    "    trainer = Trainer(model=model, args=args, data_collator=data_collator,\n",
    "                      train_dataset=train_dataset, eval_dataset=val_dataset, prediction_loss_only=True,)\n",
    "\n",
    "    eval_loss = trainer.evaluate()\n",
    "    eval_loss = eval_loss['eval_loss']\n",
    "    logger.info(f'Initial eval bpc: {eval_loss/math.log(2)}')\n",
    "    \n",
    "    if not eval_only:\n",
    "        trainer.train(model_path=model_path)\n",
    "        trainer.save_model()\n",
    "\n",
    "        eval_loss = trainer.evaluate()\n",
    "        eval_loss = eval_loss['eval_loss']\n",
    "        logger.info(f'Eval bpc after pretraining: {eval_loss/math.log(2)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AqY_Hg5HbD6a"
   },
   "source": [
    "**Training hyperparameters**\n",
    "\n",
    "- Following RoBERTa pretraining setting, we set number of tokens per batch to be `2^18` tokens. Changing this number might require changes in the lr, lr-scheudler, #steps and #warmup steps. Therefor, it is a good idea to keep this number constant.\n",
    "\n",
    "- Note that: `#tokens/batch = batch_size x #gpus x gradient_accumulation x seqlen`\n",
    "   \n",
    "- In [the paper](https://arxiv.org/pdf/2004.05150.pdf), we train for 65k steps, but 3k is probably enough (check table 6)\n",
    "\n",
    "- **Important note**: The lr-scheduler in [the paper](https://arxiv.org/pdf/2004.05150.pdf) is polynomial_decay with power 3 over 65k steps. To train for 3k steps, use a constant lr-scheduler (after warmup). Both lr-scheduler are not supported in HF trainer, and at least **constant lr-scheduler** will need to be added. \n",
    "\n",
    "- Pretraining will take 2 days on 1 x 32GB GPU with fp32. Consider using fp16 and using more gpus to train faster (if you increase `#gpus`, reduce `gradient_accumulation` to maintain `#tokens/batch` as mentioned earlier).\n",
    "\n",
    "- As a demonstration, this notebook is training on wikitext103 but wikitext103 is rather small that it takes 7 epochs to train for 3k steps Consider doing a single epoch on a larger dataset (800M tokens) instead.\n",
    "\n",
    "- Set #gpus using `CUDA_VISIBLE_DEVICES`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Zl_hDDlryVo2"
   },
   "outputs": [],
   "source": [
    "@dataclass\n",
    "class ModelArgs:\n",
    "    attention_window: int = field(default=512, metadata={\"help\": \"Size of attention window\"})\n",
    "    max_pos: int = field(default=4096, metadata={\"help\": \"Maximum position\"})\n",
    "\n",
    "parser = HfArgumentParser((TrainingArguments, ModelArgs,))\n",
    "\n",
    "\n",
    "training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[\n",
    "    '--output_dir', 'tmp',\n",
    "    '--warmup_steps', '500',\n",
    "    '--learning_rate', '0.00003',\n",
    "    '--weight_decay', '0.01',\n",
    "    '--adam_epsilon', '1e-6',\n",
    "    '--max_steps', '3000',\n",
    "    '--logging_steps', '500',\n",
    "    '--save_steps', '500',\n",
    "    '--max_grad_norm', '5.0',\n",
    "    '--per_gpu_eval_batch_size', '8',\n",
    "    '--per_gpu_train_batch_size', '2',  # 32GB gpu with fp32\n",
    "    '--gradient_accumulation_steps', '32',\n",
    "    '--evaluate_during_training',\n",
    "    '--do_train',\n",
    "    '--do_eval',\n",
    "])\n",
    "training_args.val_datapath = 'wikitext-103-raw/wiki.valid.raw'\n",
    "training_args.train_datapath = 'wikitext-103-raw/wiki.train.raw'\n",
    "\n",
    "# Choose GPU\n",
    "import os\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "wi5ZIKcsbD6e"
   },
   "source": [
    "### Put it all together\n",
    "\n",
    "1) Evaluating `roberta-base` on MLM to establish a baseline. Validation `bpc` = `2.536` which is higher than the `bpc` values in table 6 [here](https://arxiv.org/pdf/2004.05150.pdf) because wikitext103 is harder than our pretraining corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "colab": {
     "referenced_widgets": [
      "b59172f252764a3cae42171d649fa160"
     ]
    },
    "colab_type": "code",
    "id": "JiKj7D1c1ovy",
    "outputId": "dd430ba8-c69b-4110-e384-550a7b4875e6"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json from cache at /home/beltagy/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690\n",
      "INFO:transformers.configuration_utils:Model config RobertaConfig {\n",
      "  \"architectures\": [\n",
      "    \"RobertaForMaskedLM\"\n",
      "  ],\n",
      "  \"attention_probs_dropout_prob\": 0.1,\n",
      "  \"bos_token_id\": 0,\n",
      "  \"eos_token_id\": 2,\n",
      "  \"hidden_act\": \"gelu\",\n",
      "  \"hidden_dropout_prob\": 0.1,\n",
      "  \"hidden_size\": 768,\n",
      "  \"initializer_range\": 0.02,\n",
      "  \"intermediate_size\": 3072,\n",
      "  \"layer_norm_eps\": 1e-05,\n",
      "  \"max_position_embeddings\": 514,\n",
      "  \"model_type\": \"roberta\",\n",
      "  \"num_attention_heads\": 12,\n",
      "  \"num_hidden_layers\": 12,\n",
      "  \"pad_token_id\": 1,\n",
      "  \"type_vocab_size\": 1,\n",
      "  \"vocab_size\": 50265\n",
      "}\n",
      "\n",
      "INFO:transformers.modeling_utils:loading weights file https://cdn.huggingface.co/roberta-base-pytorch_model.bin from cache at /home/beltagy/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f05d90934111628e0e1c09a33bd4a121358e1.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e\n",
      "INFO:transformers.modeling_utils:Weights of RobertaForMaskedLM not initialized from pretrained model: ['lm_head.decoder.bias']\n",
      "INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at /home/beltagy/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b\n",
      "INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt from cache at /home/beltagy/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda\n",
      "INFO:__main__:Evaluating roberta-base (seqlen: 512) for refernece ...\n",
      "INFO:filelock:Lock 140599059908048 acquired on wikitext-103-raw/cached_lm_RobertaTokenizerFast_510_wiki.valid.raw.lock\n",
      "INFO:transformers.data.datasets.language_modeling:Loading features from cached file wikitext-103-raw/cached_lm_RobertaTokenizerFast_510_wiki.valid.raw [took 0.008 s]\n",
      "INFO:filelock:Lock 140599059908048 released on wikitext-103-raw/cached_lm_RobertaTokenizerFast_510_wiki.valid.raw.lock\n",
      "INFO:transformers.trainer:You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.\n",
      "INFO:transformers.trainer:***** Running Evaluation *****\n",
      "INFO:transformers.trainer:  Num examples = 489\n",
      "INFO:transformers.trainer:  Batch size = 8\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ac41af48affc41fea98bd4f611ee3e8f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=62.0, style=ProgressStyle(description_wi…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:__main__:Initial eval bpc: 2.536170097033518\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "{\"eval_loss\": 1.7579391521792258, \"step\": null}\n"
     ]
    }
   ],
   "source": [
    "roberta_base = RobertaForMaskedLM.from_pretrained('roberta-base')\n",
    "roberta_base_tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')\n",
    "logger.info('Evaluating roberta-base (seqlen: 512) for refernece ...')\n",
    "pretrain_and_evaluate(training_args, roberta_base, roberta_base_tokenizer, eval_only=True, model_path=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "MBXU3r69bD6l"
   },
   "source": [
    "2) As descriped in `create_long_model`, convert a `roberta-base` model into `roberta-base-4096` which is an instance of `RobertaLong`, then save it to the disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "c7tcsSfZ1-b9"
   },
   "outputs": [],
   "source": [
    "model_path = f'{training_args.output_dir}/roberta-base-{model_args.max_pos}'\n",
    "if not os.path.exists(model_path):\n",
    "    os.makedirs(model_path)\n",
    "\n",
    "logger.info(f'Converting roberta-base into roberta-base-{model_args.max_pos}')\n",
    "model, tokenizer = create_long_model(\n",
    "    save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "JiftMH3-zPUS"
   },
   "source": [
    "3) Load `roberta-base-4096` from the disk. This model works for long sequences even without pretraining. If you don't want to pretrain, you can stop here and start finetuning your `roberta-base-4096` on downstream tasks 🎉🎉🎉"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "H8vNeYdrzMd2"
   },
   "outputs": [],
   "source": [
    "logger.info(f'Loading the model from {model_path}')\n",
    "tokenizer = RobertaTokenizerFast.from_pretrained(model_path)\n",
    "model = RobertaLongForMaskedLM.from_pretrained(model_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "kS0Np2F4bD6p"
   },
   "source": [
    "4) Pretrain `roberta-base-4096` for `3k` steps, each steps has `2^18` tokens. Notes: \n",
    "\n",
    "- The `training_args.max_steps = 3 ` is just for the demo. **Remove this line for the actual training**\n",
    "\n",
    "- Training for `3k` steps will take 2 days on a single 32GB gpu with `fp32`. Consider using `fp16` and more gpus to train faster. \n",
    "\n",
    "- Tokenizing the training data the first time is going to take 5-10 minutes.\n",
    "\n",
    "- MLM validation `bpc` **before** pretraining: **2.652**, a bit worse than the **2.536** of `roberta-base`. As discussed in [the paper](https://arxiv.org/pdf/2004.05150.pdf) this is expected because the model didn't learn yet to work with the sliding window attention. \n",
    "\n",
    "- MLM validation `bpc` after pretraining for a few number of steps: **2.628**. It is quickly getting better. By 3k steps, it should be better than the **2.536** of `roberta-base`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "colab": {
     "referenced_widgets": [
      "f2d298034ac44de397e262fd9902f4cf",
      "dbff0a67283b41c7acc81c9bc7b3c3d2",
      "b8cfa9900fe745f58430dfbd1e86f25e",
      "21cf3ef186aa4895aa9741376834a84e"
     ]
    },
    "colab_type": "code",
    "id": "SHD7QMUWbD6q",
    "outputId": "684d79ed-2237-4c7d-f0b9-dd6ba6b45d00"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:__main__:Pretraining roberta-base-4096 ... \n",
      "INFO:filelock:Lock 140598563391376 acquired on wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.valid.raw.lock\n",
      "INFO:transformers.data.datasets.language_modeling:Loading features from cached file wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.valid.raw [took 0.017 s]\n",
      "INFO:filelock:Lock 140598563391376 released on wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.valid.raw.lock\n",
      "INFO:__main__:Loading and tokenizing training data is usually slow: wikitext-103-raw/wiki.train.raw\n",
      "INFO:filelock:Lock 140599059908048 acquired on wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.train.raw.lock\n",
      "INFO:transformers.data.datasets.language_modeling:Loading features from cached file wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.train.raw [took 5.838 s]\n",
      "INFO:filelock:Lock 140599059908048 released on wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.train.raw.lock\n",
      "INFO:transformers.trainer:You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.\n",
      "INFO:transformers.trainer:***** Running Evaluation *****\n",
      "INFO:transformers.trainer:  Num examples = 61\n",
      "INFO:transformers.trainer:  Batch size = 8\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0dfddf7d39d1402bb83101f580a858b7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=8.0, style=ProgressStyle(description_wid…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:__main__:Initial eval bpc: 2.6521989344600327\n",
      "INFO:transformers.trainer:***** Running training *****\n",
      "INFO:transformers.trainer:  Num examples = 29114\n",
      "INFO:transformers.trainer:  Num Epochs = 1\n",
      "INFO:transformers.trainer:  Instantaneous batch size per device = 2\n",
      "INFO:transformers.trainer:  Total train batch size (w. parallel, distributed & accumulation) = 64\n",
      "INFO:transformers.trainer:  Gradient Accumulation steps = 32\n",
      "INFO:transformers.trainer:  Total optimization steps = 3\n",
      "INFO:transformers.trainer:  Starting fine-tuning.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "{\"eval_loss\": 1.8383642137050629, \"step\": null}\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d7b45553807e4ec1b6dc691ebcf27a8b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bbed8d4111f548d8a51da4b794c079ac",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Iteration', max=14557.0, style=ProgressStyle(description_…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:transformers.trainer:\n",
      "\n",
      "Training completed. Do not forget to share your model on huggingface.co/models =)\n",
      "\n",
      "\n",
      "INFO:transformers.trainer:Saving model checkpoint to tmp\n",
      "INFO:transformers.configuration_utils:Configuration saved in tmp/config.json\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:transformers.modeling_utils:Model weights saved in tmp/pytorch_model.bin\n",
      "INFO:transformers.trainer:***** Running Evaluation *****\n",
      "INFO:transformers.trainer:  Num examples = 61\n",
      "INFO:transformers.trainer:  Batch size = 8\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1fced5a44b4145ebb2b3ea2f40b06b2c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=8.0, style=ProgressStyle(description_wid…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:__main__:Eval bpc after pretraining: 2.6277886199054827\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "{\"eval_loss\": 1.8214442729949951, \"epoch\": 0.008793020539946418, \"step\": 4}\n"
     ]
    }
   ],
   "source": [
    "logger.info(f'Pretraining roberta-base-{model_args.max_pos} ... ')\n",
    "\n",
    "training_args.max_steps = 3   ## <<<<<<<<<<<<<<<<<<<<<<<< REMOVE THIS <<<<<<<<<<<<<<<<<<<<<<<<\n",
    "\n",
    "pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "6GOc0_6SbD6u"
   },
   "source": [
    "5) Copy global projection layers. MLM pretraining doesn't train global projections, so we need to call `copy_proj_layers` to copy the local projection layers to the global ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "obupoA0FbD6v",
    "outputId": "059fa657-a3f4-401d-8230-a9997ffaed67"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:__main__:Copying local projection layers into global projection layers ... \n",
      "INFO:__main__:Saving model to tmp/roberta-base-4096\n",
      "INFO:transformers.configuration_utils:Configuration saved in tmp/roberta-base-4096/config.json\n",
      "INFO:transformers.modeling_utils:Model weights saved in tmp/roberta-base-4096/pytorch_model.bin\n"
     ]
    }
   ],
   "source": [
    "logger.info(f'Copying local projection layers into global projection layers ... ')\n",
    "model = copy_proj_layers(model)\n",
    "logger.info(f'Saving model to {model_path}')\n",
    "model.save_pretrained(model_path)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "jtZ2sfWkbD60"
   },
   "source": [
    "🎉🎉🎉🎉 **DONE**. 🎉🎉🎉🎉\n",
    "\n",
    "`model` can now be used for finetuning on downstream tasks after loading it from the disk. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "NOBmcHTj3NPz"
   },
   "outputs": [],
   "source": [
    "logger.info(f'Loading the model from {model_path}')\n",
    "tokenizer = RobertaTokenizerFast.from_pretrained(model_path)\n",
    "model = RobertaLongForMaskedLM.from_pretrained(model_path)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "convert_model_to_long.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
